In [1]:

    
using DataFrames









    



INFO: Precompiling module DataFrames...



In [9]:

    
df = readtable("data.csv");

DataFrames

DataFrame Methods

There are various simple methods you can use to inspect a DataFrame



In [10]:

    
size(df)









    Out[10]:





(250000,20)



In [11]:

    
names(df)









    Out[11]:





20-element Array{Symbol,1}:
 :timestamp             
 :page_group            
 :geo_cc                
 :geo_rg                
 :geo_org               
 :geo_netspeed          
 :user_agent_family     
 :user_agent_major      
 :user_agent_minor      
 :user_agent_os         
 :user_agent_osversion  
 :user_agent_device_type
 :user_agent_model      
 :params_dom_sz         
 :params_dom_ln         
 :params_dom_script     
 :params_dom_img        
 :timers_t_done         
 :timers_t_resp         
 :timers_t_page

Each column of a DataFrame is a DataArray

You can reference a column using the column name as a Symbol subscript. A DataArray is just a regular array that can contain NA, which is Juliaspeak for NULL.



In [12]:

    
df[:timers_t_done]









    Out[12]:





250000-element DataArrays.DataArray{Int64,1}:
  6257
  5955
 14750
 12266
 10773
  6604
  3502
  6073
  6554
  6546
 12472
 10238
 10995
     ⋮
  7827
  6673
  6624
  6189
  4699
  2968
  7348
  7265
  8626
  3756
  3836
  4439



In [13]:

    
df[30:40, :timers_t_done]









    Out[13]:





11-element DataArrays.DataArray{Int64,1}:
  2857
  3056
  5124
  3188
  4841
  4680
  4879
  6106
  4516
  5557
 12049



In [14]:

    
df[30:40, [:timestamp, :geo_cc, :geo_netspeed, :user_agent_family, :timers_t_done]]









    Out[14]:




timestamp geo_cc geo_netspeed user_agent_family timers_t_done
1 1455611221592 PL NA IE 2857
2 1455611283782 PL NA IE 3056
3 1455611355679 PL NA IE 5124
4 1455612940770 GB Cable/DSL Safari 3188
5 1455613685100 IE Dialup IE 4841
6 1455613730994 IE Dialup IE 4680
7 1455614657272 UA NA Firefox 4879
8 1455614335263 UA NA Firefox 6106
9 1455614452250 UA NA Firefox 4516
10 1455612605862 GB Cable/DSL Safari 5557
11 1455613527347 SA NA Chrome 12049

Stats on DataFrames

Most Julia stats functions run on AbstractArray, which is the base type for Array as well as DataArray, so you can run them on any column of a DataFrame that contains numbers. You will probably need to remove NAs first using the dropna function.

Our test dataset doesn't contain any NA values for the timers_t_done column, so we're safe.



In [15]:

    
summarystats(df[:timers_t_done])









    Out[15]:





Summary Stats:
Mean:         5858.974556
Minimum:      8.000000
1st Quartile: 2357.000000
Median:       3973.000000
3rd Quartile: 6688.000000
Maximum:      2536087.000000

Histograms

The hist function will by default split the dataset into equal sized buckets based on the data's range. This may not always be what you want, so you can pass in a list of thresholds as the second parameter.

The hist function returns a tuple. The first element is the thresholds used, which might be a Range object or an Array. The second element is the list of bucket frequencies.



In [16]:

    
hist(df[:timers_t_done])









    Out[16]:





(0.0:200000.0:2.6e6,[249866,84,44,2,1,1,0,1,0,0,0,0,1])

Creating thresholds based on the data

We could use static thresholds, but that wouldn't adapt to different data sets. In this case, we develop a Julia function that determines thresholds based on the dataset.

Rather than divide the entire range into a fixed set of buckets, we divide the Inter-Quartile Range. This has the advantage of excluding outliers from the basic range. We then include outliers in their own buckets, one for the low bound and one for the high bound.

This is very similar to a box and whiskers plot.



In [17]:

    
# Function to set histogram thresholds after dropping outliers based on IQR
function getSymmetricThresholds(results::DataFrame; timer::Symbol=:timers_t_done)
    summary = summarystats(results[timer])
    fw  = (summary.q75-summary.q25)*1.5

    low = round(Int64, max(summary.min, summary.q25-fw))
    high = round(Int64, min(summary.max, summary.q75+fw))+1

    thresholds::Array{Int64, 1} = []

    nthresholds=25

    range = high - low

    for i in 0:nthresholds-1
        push!(thresholds, round(Int64, low + i * range/nthresholds))
    end

    push!(thresholds, high)
    if high < round(Int64, summary.max)
        push!(thresholds, round(Int64, summary.max))
    end

    return thresholds
end









    Out[17]:





getSymmetricThresholds (generic function with 1 method)

Julia Functions

Notice that Julia functions are declared using the function keyword. Function parameters may have types attached to them, this is optional, and mainly useful when you overload function names.

Functions may have optional parameters, a ; separates required parameters from optional ones.

When passing optional parameters to a function, they need to be passed by name, and order doesn't matter.

A function typically only returns a single value, though that value may be a tuple of multiple objects. The caller can then receive the return value into a single tuple or multiple values enclosed in ().



In [18]:

    
thresholds = getSymmetricThresholds(df)









    Out[18]:





27-element Array{Int64,1}:
       8
     535
    1062
    1589
    2116
    2643
    3170
    3698
    4225
    4752
    5279
    5806
    6333
       ⋮
    7914
    8441
    8968
    9495
   10023
   10550
   11077
   11604
   12131
   12658
   13185
 2536087

Running the hist function using our new thresholds gets us much better granularity into the data.



In [19]:

    
hist_global = hist(df[:timers_t_done], thresholds)[2]









    Out[19]:





26-element Array{Int64,1}:
   252
  6337
 19357
 25199
 24620
 21891
 18662
 16302
 14786
 12989
 11284
  9803
  8566
  7237
  6349
  5424
  4757
  4204
  3728
  3096
  2669
  2429
  2098
  1729
  1529
 14702

Filtering DataFrames

We can also filter a DataFrame on the value of one or more fields. In the following example, we filter on all :geo_rg that are not NA and equal to US:: OR.



In [22]:

    
results_US = df[!isna(df[:geo_cc]) & (df[:geo_cc] .== "US"), :];



In [23]:

    
hist_US = hist(results_US[:timers_t_done], thresholds)[2]









    Out[23]:





26-element Array{Int64,1}:
   249
  6243
 19010
 24561
 23830
 21014
 17859
 15578
 14051
 12363
 10758
  9309
  8103
  6823
  6006
  5101
  4462
  3903
  3483
  2896
  2473
  2265
  1926
  1584
  1419
 13249

Statistical Correlation

The cor function lets us run a correlation between the two histograms that we have



In [24]:

    
cor(hist_global, hist_US)









    Out[24]:





0.9995864880638793

We could also run cumsum to generate the CDF from the histogram and correlate those values.



In [25]:

    
cor(cumsum(hist_global), cumsum(hist_US))









    Out[25]:





0.9999733633808021

Splitting/Grouping a DataFrame

Use the by function to run an aggregation on a DataFrame grouped by one or more columns



In [26]:

    
by(df, :user_agent_family, rows -> median(rows[:timers_t_done]))









    Out[26]:




user_agent_family x1
1 (Unknown) 3740.0
2 AOL 4857.0
3 Amazon Silk 7599.0
4 Android Browser 11886.0
5 BlackBerry WebKit 8684.0
6 Chrome 3129.0
7 Chrome Frame 4067.0
8 Chrome Mobile 6776.0
9 Chrome Mobile iOS 4257.0
10 Chromium 2772.0
11 Edge 3187.0
12 Firefox 3412.0
13 Firefox Alpha 3747.0
14 Firefox Beta 10278.0
15 Firefox Mobile 8456.0
16 Halebot 4654.0
17 IE 2862.0
18 IE Mobile 6265.5
19 Iron 2738.0
20 Maxthon 3961.0
21 Mobile Safari 4147.0
22 Nokia Services (WAP) Browser 11402.0
23 Opera 12211.0
24 Opera Coast 6091.0
25 Opera Mini 3835.0
26 Opera Mobile 12235.0
27 Other 4855.5
28 Pale Moon (Firefox Variant) 4797.5
29 PhantomJS 1660.5
30 Puffin 922.0
&vellip &vellip &vellip

Problems if the aggregation function returns an array

If the aggregation function returns an array, like the hist function does, then we'll actually end up with one row per array element. Instead we need to serialize the array to a string or create a custom data type that encapsulates the array. The string method is easier albeit a little slower, but if we're going to export our data to JavaScript, we may need to do this anyway.



In [27]:

    
by(
    df,
    :user_agent_family, 
    rows -> DataFrame(
        count = size(rows, 1),
        median = median(rows[:timers_t_done]),
        hist = JSON.json(hist(rows[:timers_t_done], thresholds)[2])
    )
)









    Out[27]:




user_agent_family count median hist
1 (Unknown) 73 3740.0 [0,1,8,9,6,5,7,5,0,2,1,2,4,0,0,3,1,5,3,1,2,2,0,1,0,5]
2 AOL 15 4857.0 [0,0,1,1,1,1,1,0,2,2,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,2]
3 Amazon Silk 2323 7599.0 [0,0,0,1,15,39,84,117,133,147,145,165,155,113,141,113,108,90,78,70,67,76,53,49,49,315]
4 Android Browser 1752 11886.0 [0,1,0,0,0,3,17,29,43,50,66,63,76,65,80,46,64,55,52,48,44,41,52,46,32,779]
5 BlackBerry WebKit 43 8684.0 [0,0,0,0,1,1,1,0,2,1,2,1,3,3,4,1,3,0,2,1,0,2,2,0,1,12]
6 Chrome 53086 3129.0 [65,2116,6133,7217,6356,5033,3894,3216,2735,2225,1788,1558,1398,1164,993,872,711,606,563,464,437,342,322,275,246,2357]
7 Chrome Frame 37 4067.0 [0,0,1,4,4,5,3,4,4,2,1,0,1,0,2,0,1,1,0,1,0,1,0,0,0,2]
8 Chrome Mobile 31477 6776.0 [0,1,23,60,230,594,1106,1698,2232,2663,2709,2494,2268,2082,1793,1607,1356,1140,997,835,671,612,504,402,369,3031]
9 Chrome Mobile iOS 1987 4257.0 [0,22,86,184,182,198,179,132,107,101,82,69,60,48,42,36,37,39,39,37,26,21,18,31,17,194]
10 Chromium 5 2772.0 [0,0,0,0,2,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0]
11 Edge 6150 3187.0 [3,151,632,843,800,633,513,434,353,272,225,182,149,122,104,88,72,74,54,42,62,39,43,26,24,210]
12 Firefox 11984 3412.0 [14,372,1051,1392,1398,1255,1085,832,803,575,548,383,322,267,220,156,148,115,134,99,74,82,64,38,52,505]
13 Firefox Alpha 1 3747.0 [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
14 Firefox Beta 9 10278.0 [0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,1,0,0,0,0,1,3]
15 Firefox Mobile 204 8456.0 [0,0,0,0,2,4,5,8,8,9,12,13,11,13,6,10,9,10,13,7,6,10,8,7,3,30]
16 Halebot 10 4654.0 [0,0,0,2,1,1,0,0,2,1,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,0]
17 IE 35360 2862.0 [30,1305,4636,5553,4614,3543,2714,2141,1811,1395,1151,971,782,688,581,412,350,335,274,253,201,186,156,106,93,1079]
18 IE Mobile 226 6265.5 [0,0,0,0,0,2,6,10,15,18,34,30,14,15,10,6,13,5,7,2,4,4,1,2,1,27]
19 Iron 10 2738.0 [0,0,0,1,3,3,0,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
20 Maxthon 6 3961.0 [0,0,0,0,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0]
21 Mobile Safari 87775 4147.0 [32,1018,4282,7212,8901,9018,7763,6621,5638,4906,4083,3472,2948,2340,2165,1890,1706,1601,1398,1157,1018,937,805,704,597,5562]
22 Nokia Services (WAP) Browser 1 11402.0 [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]
23 Opera 123 12211.0 [0,0,1,2,2,4,9,6,3,5,4,4,4,1,4,1,4,1,2,0,1,0,3,2,1,59]
24 Opera Coast 3 6091.0 [0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0]
25 Opera Mini 7 3835.0 [0,0,0,1,1,0,1,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
26 Opera Mobile 9 12235.0 [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,1,2,0,3]
27 Other 86 4855.5 [0,0,0,3,3,5,7,8,15,14,4,4,4,1,1,1,1,1,1,1,1,1,1,1,0,8]
28 Pale Moon (Firefox Variant) 18 4797.5 [0,0,0,1,1,2,0,4,1,1,2,4,0,1,0,0,0,1,0,0,0,0,0,0,0,0]
29 PhantomJS 78 1660.5 [0,7,27,26,10,7,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
30 Puffin 5 922.0 [0,4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
&vellip &vellip &vellip &vellip &vellip

Copy the JSON to a JavaScript file when first testing D3 code

It's easier to start your D3 experimentation with a standalone file rather than within the IJulia interface. A simpler dev setup is easier to debug.



In [28]:

    
println("Histogram:\n", JSON.json(hist_global))
println()
println("Thresholds:\n", JSON.json(thresholds))









    



Histogram:
[252,6337,19357,25199,24620,21891,18662,16302,14786,12989,11284,9803,8566,7237,6349,5424,4757,4204,3728,3096,2669,2429,2098,1729,1529,14702]

Thresholds:
[8,535,1062,1589,2116,2643,3170,3698,4225,4752,5279,5806,6333,6860,7387,7914,8441,8968,9495,10023,10550,11077,11604,12131,12658,13185,2536087]



In [ ]:

	timestamp	geo_cc	geo_netspeed	user_agent_family	timers_t_done
1	1455611221592	PL	NA	IE	2857
2	1455611283782	PL	NA	IE	3056
3	1455611355679	PL	NA	IE	5124
4	1455612940770	GB	Cable/DSL	Safari	3188
5	1455613685100	IE	Dialup	IE	4841
6	1455613730994	IE	Dialup	IE	4680
7	1455614657272	UA	NA	Firefox	4879
8	1455614335263	UA	NA	Firefox	6106
9	1455614452250	UA	NA	Firefox	4516
10	1455612605862	GB	Cable/DSL	Safari	5557
11	1455613527347	SA	NA	Chrome	12049

	user_agent_family	x1
1	(Unknown)	3740.0
2	AOL	4857.0
3	Amazon Silk	7599.0
4	Android Browser	11886.0
5	BlackBerry WebKit	8684.0
6	Chrome	3129.0
7	Chrome Frame	4067.0
8	Chrome Mobile	6776.0
9	Chrome Mobile iOS	4257.0
10	Chromium	2772.0
11	Edge	3187.0
12	Firefox	3412.0
13	Firefox Alpha	3747.0
14	Firefox Beta	10278.0
15	Firefox Mobile	8456.0
16	Halebot	4654.0
17	IE	2862.0
18	IE Mobile	6265.5
19	Iron	2738.0
20	Maxthon	3961.0
21	Mobile Safari	4147.0
22	Nokia Services (WAP) Browser	11402.0
23	Opera	12211.0
24	Opera Coast	6091.0
25	Opera Mini	3835.0
26	Opera Mobile	12235.0
27	Other	4855.5
28	Pale Moon (Firefox Variant)	4797.5
29	PhantomJS	1660.5
30	Puffin	922.0
&vellip	&vellip	&vellip

	user_agent_family	count	median	hist
1	(Unknown)	73	3740.0	[0,1,8,9,6,5,7,5,0,2,1,2,4,0,0,3,1,5,3,1,2,2,0,1,0,5]
2	AOL	15	4857.0	[0,0,1,1,1,1,1,0,2,2,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,2]
3	Amazon Silk	2323	7599.0	[0,0,0,1,15,39,84,117,133,147,145,165,155,113,141,113,108,90,78,70,67,76,53,49,49,315]
4	Android Browser	1752	11886.0	[0,1,0,0,0,3,17,29,43,50,66,63,76,65,80,46,64,55,52,48,44,41,52,46,32,779]
5	BlackBerry WebKit	43	8684.0	[0,0,0,0,1,1,1,0,2,1,2,1,3,3,4,1,3,0,2,1,0,2,2,0,1,12]
6	Chrome	53086	3129.0	[65,2116,6133,7217,6356,5033,3894,3216,2735,2225,1788,1558,1398,1164,993,872,711,606,563,464,437,342,322,275,246,2357]
7	Chrome Frame	37	4067.0	[0,0,1,4,4,5,3,4,4,2,1,0,1,0,2,0,1,1,0,1,0,1,0,0,0,2]
8	Chrome Mobile	31477	6776.0	[0,1,23,60,230,594,1106,1698,2232,2663,2709,2494,2268,2082,1793,1607,1356,1140,997,835,671,612,504,402,369,3031]
9	Chrome Mobile iOS	1987	4257.0	[0,22,86,184,182,198,179,132,107,101,82,69,60,48,42,36,37,39,39,37,26,21,18,31,17,194]
10	Chromium	5	2772.0	[0,0,0,0,2,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0]
11	Edge	6150	3187.0	[3,151,632,843,800,633,513,434,353,272,225,182,149,122,104,88,72,74,54,42,62,39,43,26,24,210]
12	Firefox	11984	3412.0	[14,372,1051,1392,1398,1255,1085,832,803,575,548,383,322,267,220,156,148,115,134,99,74,82,64,38,52,505]
13	Firefox Alpha	1	3747.0	[0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
14	Firefox Beta	9	10278.0	[0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,1,0,0,0,0,1,3]
15	Firefox Mobile	204	8456.0	[0,0,0,0,2,4,5,8,8,9,12,13,11,13,6,10,9,10,13,7,6,10,8,7,3,30]
16	Halebot	10	4654.0	[0,0,0,2,1,1,0,0,2,1,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,0]
17	IE	35360	2862.0	[30,1305,4636,5553,4614,3543,2714,2141,1811,1395,1151,971,782,688,581,412,350,335,274,253,201,186,156,106,93,1079]
18	IE Mobile	226	6265.5	[0,0,0,0,0,2,6,10,15,18,34,30,14,15,10,6,13,5,7,2,4,4,1,2,1,27]
19	Iron	10	2738.0	[0,0,0,1,3,3,0,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
20	Maxthon	6	3961.0	[0,0,0,0,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0]
21	Mobile Safari	87775	4147.0	[32,1018,4282,7212,8901,9018,7763,6621,5638,4906,4083,3472,2948,2340,2165,1890,1706,1601,1398,1157,1018,937,805,704,597,5562]
22	Nokia Services (WAP) Browser	1	11402.0	[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]
23	Opera	123	12211.0	[0,0,1,2,2,4,9,6,3,5,4,4,4,1,4,1,4,1,2,0,1,0,3,2,1,59]
24	Opera Coast	3	6091.0	[0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0]
25	Opera Mini	7	3835.0	[0,0,0,1,1,0,1,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
26	Opera Mobile	9	12235.0	[0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,1,2,0,3]
27	Other	86	4855.5	[0,0,0,3,3,5,7,8,15,14,4,4,4,1,1,1,1,1,1,1,1,1,1,1,0,8]
28	Pale Moon (Firefox Variant)	18	4797.5	[0,0,0,1,1,2,0,4,1,1,2,4,0,1,0,0,0,1,0,0,0,0,0,0,0,0]
29	PhantomJS	78	1660.5	[0,7,27,26,10,7,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
30	Puffin	5	922.0	[0,4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
&vellip	&vellip	&vellip	&vellip	&vellip